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Abstract —Regression bugs occur whenever software function¬ 
ality that previously worked as desired stops working, or no 
longer works as expected. Code changes, such as bug fixes or new 
feature work, may result in a regression bug. Regression bugs are 
an annoying and painful phenomena in the software development 
process, requiring a great deal of effort to localize, effectively 
hindering team progress. In this paper we present Regression 
Detective, a method which assists the developer locating source 
code segments that caused a given regression bug. Unlike some of 
the existing tools, our approach doesn’t require an automated test 
suite or executing past versions of the system. It is highly scalable 
to millions of loc systems. The developer, who has no prior 
knowledge of the code or the hug, reproduces the hug according 
to the steps described in the bug database. We evaluated our 
approach with bugs from leading open source projects (Eclipse, 
Tomcat, Ant). In over 90% of the cases, the developer only has to 
examine 10-20 lines of code in order to locate the bug, regardless 
of the code base size. 

Index Terms —Development Tools;Regression Bugs; Fault Lo¬ 
calization; Debugging. 

I. Introduction 

Program evolution and repair are major activities of software 
maintenance, which consumes a significant fraction of the total 
cost of software production m 

During software development, new functionality is intro¬ 
duced, increasing the complexity of the system. Increased 
complexity often results in a reduced ability to estimate the 
impact of the code change. For example, mature and successful 
software, often supports several operating systems versions 
and configurations. When a developer makes a change, they 
are not always aware of all the configurations impacted. 

Further more, every code change now has to interact with 
more and more parts of the existing code, because it needs 
to reuse some of its functionality and integrate with other 
parts. Often, the older code was not designed with many of 
the current requirements in mind, so parts may need to be 
rewritten, to allow reuse by the new code, or comply with non¬ 
functional requirements, such as security and performance. 

While code reuse is a desired property of a system, changes 
to infrastructure code can easily result in a break of one of its 
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many clients (callers), especially when the interface (contract) 
is not well defined, and the clients hold different assumptions. 

Sometimes, the maintainer is not the same person who wrote 
the code, which results in a loss of knowledge about system 
behaviours and flows 0. Being aware of the increased risk, 
the maintainers may employ minimal-change strategy, often 
referred to as patching. While minimal-change (no refactoring) 
is low risk in the short term, it is a kind of greedy approach, 
often resulting in surprising or hard to understand code. An¬ 
other outcome of no-refactoring strategy is code duplication, 
because the maintainer needs to reuse subset of the existing 
code, but doesn’t want to change it. Code duplication in its 
turn is known to surprise future maintainers (sometimes even 
the same maintainer in the future). Typically an issue is fixed 
in some of the clones, but not in all of them. 

Some of the studies 0 focus on the post-release bugs 
or security patches for stable slow-changing systems. Note 
that many of the regression bugs are introduced and fixed 
during the internal development process iterations, long before 
maintenance phase. 

In general, every code change bears a risk of resulting in 
unintended behaviour, regressing functionality that previously 
worked as desired. To mitigate the risk, regression testing was 
adopted in an attempt to revalidate the old functionality inher¬ 
ited from the old version. Unfortunately, exhaustive regression 
testing is not always cost effective 0, 0, especially if it re¬ 
quires manual effort. Other mitigations, such as Code Reviews, 
Change classification a and Change Impact Analysis using 
dynamic and static techniques a, a were also studied. Still, 
significant number of regression bugs are being fixed every 
day. 

Today, it is a time consuming task to locate the change that 
triggered the regression bug. From the author’s experience, 
many times the developer is analyzing the bug as if it was 
a regular, non-regression bug. Others use the log history of 
the source control to try to identify the offending change, 
but may have to traverse large number of unrelated changes. 
Developers can also install previous versions of the system, to 
reduce the number of changes they must inspect. Installation 
and configuration, as well as reproduction of the bug, may 
be time consuming, unless they are automated, which is not 


usually the case for new bugs. 

Our goal in this paper, is to explore a simple and scalable 
method that allows the developer to rank source code changes 
with respect to a specific regression bug scenario. The de¬ 
veloper, who has no prior knowledge of the code or the bug, 
reproduces the bug according to the steps described in the bug 
database. As such, the tool requires only minimal training. As 
we understand that runtime overhead is an inhibiting factor in 
developer tools adoption, we kept it to a minimal percentage 
of the execution time. 

In summary, we make the following contributions; 

• Highly accurate localization for real world bugs, tested 
on several open source projects from different domains. 

• Simple to implement and reason about. 

• Demonstrates the temporal trace locality of the faulty 
change and the error, by using the novel Execution Order 
ranking. 

• Combines multiple weak filters and ranking methods in 
a novel way, to dramatically reduce the search effort 
required. 

• The requirements for tool operation are the same as for 
traditional debugging: Execute the bug scenario with the 
instrumented build. Automated test suites, special infras¬ 
tructure or lengthy training processes are not required. 

• Usability: Complements the natural workflow of the 
debugging process, without requiring special knowledge. 

• Doesn’t require executing past versions of the system, 
unlike previous work 18] 

• Minimal runtime overhead. 

• Locates ’’Not reproducible” bugs, that cannot be repro¬ 
duced during debugging, due to date-time effects (ex: leap 
year, daylight saving time), thread schedule or any special 
configuration or input from the environment, that triggers 
the bug. 

• Fix localization: When a certain version or branch of the 
software contains a fix that should be ported (merged) to 
another branch / version, our method can locate the fix 
change. 

II. Our Approach 

We will now describe a method to rank source code changes 
that cause or expose a regression bug. Let’s start with a more 
precise definition of the software regression bug that we try 
to localize. 

Software Regression: Software functionality that previously 
worked as desired and stops working, or no longer works as 
expected, given the exact same external environment. Note 
that this definition requires a change in the system to trigger 
the regression bug, because all inputs to the system are held 
constant. 

As a starting point, the developer obtains a diff file between 
the last known good version of the software and the current, 
buggy version. The diff file is typically obtained from the 
revision control system, but can also be obtained by comparing 
2 source file trees using a diff utility. The diff file format 
consists of a set of hunks. A hunk is a consecutive range of 


lines added, updated or deleted. After executing the code, each 
hunk contains one or more sub-ranges of code that actually 
executed. We rank the suspect changes at the hunk sub-ranges 
granularity, which usually means few consecutive lines. We 
will refer to it from now on, as a change or code change. Note 
that this method assumes that one or more of the changes in the 
diff is the root cause, or the exposing change for the bug. Note 
also that while the number of changes from the good version 
vary, it could reach hundreds or thousands of changes. The 
tester or customer who opened the bug, specifies the previous 
release (say a year ago) as the good version and does not know 
the exact revision that introduced the bug, which increases the 
size of the diff file considerably. 

We use a combination of dynamic analysis methods and 
heuristics to rank the source code changes. 

A. Executed Changes 

The first filter, is executed change - a hunk that was not, at 
least partially executed in the bug reproducer scenario is not 
displayed to the user in the search results. 

In a 1 year diff of an active large system, this filter, by itself, 
can typically reduce the number of changes from thousands 
to hundreds. 


B. Execution Order 

Execution Order is a light weight heuristic, assuming tem¬ 
poral locality between faults and observed errors. Preliminary 
investigation indicated, that many of the regression observed 
errors are in a short trace-distance from their root-cause faulty 
change. Let’s define the executed trace T of a program P as 
the ordered sequence of basic blocks Bj. 

trace-distance is defined between 2 elements of the trace , as 
the difference of positions in the sequence between the two 
elements. In the example above: 

tracedist(Bi ^, 

A hunk H may span several basic blocks. Our method instru¬ 
ments all basic blocks of a hunk to log its execution. The result 
is an ordered sequence of executed hunks (with repetitions) 


< Hi^ , , Hi^ ,... > 


The developer is instructed to save execution order state as 
soon as possible after error is observed. The tool offers several 
ways to accomplish that. For example, if the bug results in 
an exception, it is easy to stop in the debugger on exception 
throw and then save the execution order. If it is a hang 
or any incorrect result in UI, the developer can click the 
’dump’ button right after (see II-E| l. Often it is easy to place 
a breakpoint to suspend execution sometimes after the error 
occurred and then save (dump). Although it is important to 
dump the state as soon as possible, we didn’t observe high 
sensitivity, especially if combined with the Differential Basic 
Block Hit ranking methods (see next section). We then employ 
a simple ranking (that can be combined with other ranking 





scores as well): The last hunk logged before the dump is 
ranked first (1), the second hunk further away is second and 
so on, according to the trace-distance from the dump point. 

The Execution Order method allows us to effectively rank 
dozens, or even hundreds of hunks, with the faulty hunk 
often near the top list. We provide here the intuition behind 
this method: Today, one of the most helpful supportive infor¬ 
mation, developers use to debug is a stack trace @. When 
available, the developer browses the stack frames preceding 
the error location in order to search for the fault. Execution 
Order covers this scenario and more: It also locates a change, 
even if it is not in the active stack trace, but has a relatively 
short trace-distance from the error. Note that while stack trace 
can be extracted mainly for bugs of type crash and hang. 
Execution Order is almost always available, because one can 
always dump the state after the error occurred. 

C. Differential Basic Block Hit 

There is a problem with Execution Order: If other, non 
faulty changes, are executed after the faulty change and before 
the state is saved, the top ranking changes will not be the faulty 
ones we are after. During preliminary research, the problem 
presented itself mainly in 2 typical cases: a) Changes in a 
background task, executing concurrently with the bug scenario 
and b) Changes in low level code that executes often, such 
as UI message loop code. If a breakpoint is set right after 
the error, the interference is insignificant, but in those cases 
the developer is using the UI button to dump the state, while 
changes continue to execute for another second or so, an 
additional ranking or noise filtering is required, in order to 
achieve top ranks. 

As a solution, we further isolate the changes relevant to the 
bug scenario by using Differential Basic Block Hit. 

1) The code is instrumented to collect basic blocks hit 
coverage information. 

2) The developer first executes the code in a scenario that 
doesn’t reproduce the problem and then saves the state 
of the covered basic blocks. This is usually done after 
the system has initialized, before the developer starts 
reproducing the bug. 

3) Then the bug scenario is executed and then state is saved 
again (cumulative blocks coverage + Execution Order 
from previous section). 

4) Difference operator is used to find all basic blocks that 
were executed in the bug scenario, but not in the first 
coverage dump (non bug scenario). 

5) We rank first all changes that have in their vicinity (in 
the same method, max 10 lines distance) a basic block 
from the coverage diff, and all other changes after. Note 
that within this rank, we combine a secondary Execution 
Order rank, described above. 

Note that this is not the same thing as ranking first the 
changes that were executed in the bug scenario but not in 
the non-bug scenario. Initial investigation revealed that even 
if the change was executed in both scenarios, but is in the same 
method, at a certain distance from a basic block in the diff. 


it should be ranked first, because it may be in the backward 
slice of a conditional that has different value in the 2 scenarios, 
therefore affecting the buggy control-flow. 

Sensitivity is also an issue here: What if the developer 
recorded the wrong non-bug scenario ? The main risk, is 
that the faulty change will enter also the non-bug coverage 
(instead of only to the bug scenario), lowering it in the rank. In 
practice, we didn’t encounter difficulties in this case, because 
we only use the non-bug scenario for ’’background noise 
filtering” purposes, so it doesn’t have to be very similar to 
the bug scenario, lowering the risk. Typically system activity 
is recorded to include the background periodic tasks and low 
level common code in the non-bug coverage. Often the non¬ 
bug recording doesn’t activate the product feature which the 
bug occurred in. Eor example: Eor a bugs in a search feature 
of a product, launch the instrumented system and perform 
few operations, without activating search. This is our non-bug 
scenario that will isolate only those changes related only to 
search functionality. 

This method alone can reduce in some cases the number of 
ranked-first changes to several dozens. In combination with the 
Execution Order, we get the final results. Note that Differential 
Basic Block Hit can be used as a filter or as a ranking method, 
where the changes that occur in methods with difference are 
ranked before those that occur in methods with no difference. 

D. Semantic Textual Affinity 

Textual search is used on a daily basis by developers (in¬ 
cluding the authors), usually via the Editor Eind functionality 
or command line tools, such as grep. There are several well 
known problems when using grep like tools: 

1) Lack of ranking: If the search term appears in method 
C multiple times, the search result will not be presented 
before a hit in method B, where it appears only once. As 
another example, term hit in method name is not ranked 
higher compared to a hit in method body. 

2) Search terms (or regular expression patterns) must gen¬ 
erate exact hit, so searching for the term login does not 
generate a hit if the code only contains related terms 
such as authentication or password. 

3) Program structure implying distance between terms is 
not utilized. Eor example, if the search term occurs in 
a callee of method containing the faulty change, which 
resides in a different file, no hit is generated by grep. 

We investigated and implemented a novel textual affinity 
search engine as an additional ranking method, combined with 
the dynamic analysis methods described above. The basis 
for the Affinity search was laid by cni and 11 and relies 
on several IR and program structure ranking heuristics. The 
developer uses search terms, which may be identical or related 
to the actual terms appearing in or near the faulty change. 

1) Identifier names in the source code and comments are 
split into parts based on known coding standards, such 
as CamelCase and underline whereas the original form 
of each identifier is preserved as well. 


2) Java keywords are treated as stop words, in addition to 
standard English stop words list. 

3) For matching search terms that do not appear in the 
source code, we have used 2 different methods: For 
extract terms appearing in WordNet lexicon ifTTl . we 
use an an adapted version of the CodePsychologists 
affinity score algorithm, which determines the score as 
a function of the inverse distance in WordNet taxonomy 
tree. The tree represents relations such as synonymy and 
hypernymy. See more details in ifTOl . 

4) For other terms, that are not covered by WordNet, we 
use Snowball stemmer to determine equivalence. 

5) WordNet does not specialize in software engineering 
terms, so the relatedness score between two related 
terms, such as screen and resolution is surprisingly 
low. To overcome this limitation, we also experimented 
with collocation techniques, where terms that co-occur 
together often in software engineering corpus are con¬ 
sidered as related. Our choice for corpus was Stack 
Overflow uni, where we used titles and tags of a 
subset of the posts. The initial experiment results are 
not conclusive. 

6) Term Weighting is an important part of current IR 
technology. We created a weighting scheme as follows: 

a) TF-IDF: TF-IDF assign higher weight to term hits 
that appear frequently in few hies and do not 
appear frequently in other source hies. We have 
used a robust variant of TF-IDF adapted to counter 
problems of frequency imbalance. 

b) Since the targets of the search are changes, term 
hit is assigned higher weight if inside an executed 
change, compared to changes further away (in lines 
of code) from the executed changes. 

c) Term hit in the containing method name, its pa¬ 
rameters or the method comments are also assigned 
higher weight, as well as the containing class name. 

7) We have also experimented with static Callgraph 03, 
to allow for hits in methods called from the method 
containing the change, these are assigned lower weight 
as their Callgraph distance is longer. We also limit the 
search to 3 levels of caller-callee. 

8) Finally, the search rank is combined with the other 
methods to form the hnal rank. 

We ended up omitting Textual Affinity from the experiments 
because of the following reasons: 

1) In our experiments, the results were satisfactory without 
using Textual Affinity. 

2) Validation complexity: It is harder to objectively choose 
the search term. In fact, we discovered it is only appli¬ 
cable to subset of the bugs, which include relevant terms 
in their Titles or Description. The same or related terms 
should appear in the vicinity of the faulty change, which 
is not always the case. Still it may be useful, as we plan 
to investigate in the future. 

3) We still need to enhance and improve the functionality 


of this engine. 


E. Developer workflow 

To get a better grasp of the ranking methods described and 
also review the user interface of the Regression Detective tool, 
we will go through an example, step by step. The tool is 
implemented as Eclipse plugin and currently supports Java. 

1) The developer starts with steps to execute the bug 
scenario. Note that formal bug report is not required. 
In our example the bug is from the Apache Ant project 
Bug 52923: JUnit4 test should not run as JUnit3 if 
annotated RunWith(JUnit4) 



2) Next, the last known good version is determined. 
A diff hie is created that includes all changes from 
the good version to current, buggy version. The 
developer typically uses revision control system, such 
as Subversion or Git (not required). The diff hie is 
imported into the IDE using a wizard dialog. Our bug 
is a regression from ant 1.8.1 reported and hxed at ant 
1.8.2 



3) The developer can optionally instrument the changes 
to activate Execution Order. When the program is 
launched, it is anyway instrumented for basic blocks 
coverage (used for Executed Changes and Differential 
Basic Block Hit methods, described above. 














































4) Now the reproducer test (attached to the bug) is executed 
and the developer clicks the dump button (in the same 
toolbar shown in the image above) to save the state of 
the Execution Order and the covered basic blocks. Note 
that ant is a command line process, hosting JUnit that 
in turn loads and executes our user-code test (using the 
wrong JUnit version). Since the whole runtime takes a 
second and it may be hard to place a breakpoint in this 
case inside the test, the developer can easily add a sleep)) 
statement to to their own test code, allow enough time 
to click the Dump button. 

5) After the execution data was collected, our plugin 
displays search results in a view, which resembles 
popular search engines UI, except that in this bug, we 
don’t need textual search, so we leave the search term 
fields empty, simply clicking the search button. 



6) It can be seen in the search view, that the second result is 
the faulty change. When clicked, it is selected in Eclipse 
Java code editor. 

Note that in this specihc bug, the Execution Order is sufficient 
to localize the bug, and further ranking with Differential Basic 
Block Hit or Textual Affinity was not required, since we already 
have a ranking of 2. The developer can incrementally choose 
to apply further ranking methods, as needed. 

The tool is integrated with the workflow of the developer 
and adds only few clicks to the traditional debugger workflow, 
after clicking a search result to inspect it in the Java Editor of 
Eclipse, she can either hx it or place breakpoints to continue 
the investigation. She can also examine the executed code (see 


green highlighted lines in the above screenshot) vs. the non- 
executed (in red) which helps to focus on the relevant area. 

F. Not reproducible bugs 

All the change ranking methods described above do not rely 
in any way on actually reproducing the error in the developer 
environment. It is sufficient that the faulty change is executed 
close to the dump state point. If Differential Basic Block Hit 
is used, it may be sufficient that the the faulty change is near 
instructions that were executed in the bug scenario but not 
in the non-bug scenario. Note that by actually reproducing 
the bug, the developer gains high confidence that they are 
executing the right scenario, but this is not the tool or the 
method requirement. Even if the order of the changes executed 
is slightly different (as long as the above sufficient conditions 
are met), the method is still effective. 

One of the authors witnessed such a real world customer 
regression bug, that couldn’t be reproduced in the R&D labs. 
After investing several days to no end, a senior architect 
(who was also the original developer of the code) was flown 
to the customer site to investigate the bug on a specific 
machine, where it can be reproduced. The root cause was 
a race condition, as a result of a code change. The faulty 
change was always executed close to the error point. It was 
determined, by manual analysis that the bug could be easily 
detected by our prototype tool, except it doesn’t support the 
runtime and language used. Today, Regression Detective only 
supports Java, but other runtime and language support can be 
added. 

Note that we do not claim that the above methods are 
optimal for localizing race conditions as there is a large body 
of work specializing in this subject. There are other common 
reasons for not reproducing the bug, while still meeting the 
above sufficient conditions, such as different external environ¬ 
ment : date time settings, locale settings, interaction with other 
software - installed only on the reproducing machine and so 
on. 

Unfortunately we didn’t And not reproducible bugs regres¬ 
sion in the open source repositories used for our evaluation, 
but our methods are easy to reason about, in order to determine 
such a property, as discussed above. 

G. Fix localization 

Software development processes often result in a creation 
of branches, which represent different versions of the system 
UM- It is sometimes the case, that a bug was fixed or a 
feature developed in one of the branches, without merging 
to all other branches immediately. When the need arises to 
port a specific bug fix or functionality to the other branches, 
the developer has to first localize the relevant changes in the 
source branch. It is often the case that the merge includes 
changes not documented in known bugs or specifications. 
Since our tool doesn’t assume the changes are necessarily 
related to bugs, it can be used to And any type of changes 
(fix, new functionality). The developer chooses a scenario that 
executes the changes and use our ranking methods to locate 




















































Name 

Description 

LOC 

Apache Tomcat 

Java application server, widely 
adopted for large-scale, mission- 
critical web applications 

IM 

Eclipse JDT Core 

The Java tooling infrastructure of 
the Eclipse Java IDE 

L5M 

Apache Ant 

Build system command line tool 

200K 


TABLE I; Open source projects used for evaluation 


the change. Another common scenario is hx reuse. The authors 
have recently encountered a bug, which is not a regression. Its 
scenario and control flow is very similar to a another, already 
hxed bug. The developer assigned with the new bug was not 
aware of the previously hxed bug with the similar scenario, 
in the same area in the code. With our tool, she can localize 
the existing hx code, in the new bug scenario, which can be 
reused to handle also the new bug. In our example, the already 
hxed bug was rewriting a certain absolute url. The new bug 
was similar, but with a relative url. Why is it considered a 
common case ? Because often bugs are reported and hxed 
for a certain scenario, while other related defective scenarios 
are not fully considered. Sometimes this partial hx results in 
a regression. In other cases, the hx only handles the exact 
reported scenario and should be extended to handle also the 
related similar defective scenarios. 

III. Implementation 

We implemented Regression Detective as Eclipse ifTSlI plu¬ 
gin for the Java programming language. Java is a popular 
programming language with several leading Open Source 
projects we could use for evaluation. Eclipse is a popular 
IDE with plugin extensibility architecture, used in much of 
the academic related work. We used a modihed version of 
Emma na and EclEmma uni Eclipse plugin as the basis 
for our coverage collector and UI, and added BugDel ifTSl 
with JavaAssist CD for hunk instrumentation. The textual 
affinity search engine was based on code with numerous 
modihcations. Eclipse JDT model and its compare plugin 
allowed us to reduce the amount of code we had to write. 

Regression Detective uses 2 types of instrumentation; The 
hrst is Emma coverage for basic blocks, which sets a boolean 
Hag to true when the block is executed. The second one is 
hunk instrumentation, which logs in a memory circular buffer 
the hunk id, whenever it is executed. 

The Search results view was created in Html5 JavaScript 
technologies using AngularJS ll20l and is an interesting exam¬ 
ple of integrating Web technologies in a Desktop application 
- Eclipse. Such an approach will enable our team to port the 
project to different IDEs in the future. 

TV. Experiment 

One of our main objectives was to demonstrate localization 
of real world regression bugs. Therefore, we selected several 
leading open source projects representing different application 
domains, such as Web application server, IDE compiler, pro¬ 
gram manipulation library and command line tools - see table 
in Note that the number of lines of code refers to the versions 


actually used in the evaluation, as some of these projects have 
grown a lot larger over the years. 

Next, we randomly selected regression bugs from the 
projects bug databases. Many of the regression bugs are not 
clearly marked as such, which affected the number of bugs 
that could participate in the evaluation. 

Eor each bug, we setup a development environment of the 
source code and build system in Eclipse for the version when 
the bug was reported and hxed. We encountered technical 
difficulties in some of the older bugs (for example a bug in 
Apache Tomcat 5.0.1) which forced us to omit several bugs, 
whose build dependencies were hard to acquire or the build 
instructions were missing. 

Bug reports often include the regression and hx location in 
the description or the comments. Information extracted before 
the experiment consist of the title and part of the description 
that doesn’t include the hx location. It is important to simulate 
the common case, where the developer doesn’t already know 
the location of the bug. 

It is interesting to read some of the discussions in the 
comments which indicate that it sometimes require days of 
collaboration between several developers, users and testers to 
localize such regressions. 

We reproduce the bug, saving dumps of the Execution Order 
and often also coverage information for Differential Basic 
Block Hit. The precise point for save dumps was selected per 
bug and is detailed in table |I^ 

Eor ground truth we used the hx associated with the bug. 
Many of the bugs have a patch with the hx attached, or a 
hx revision in the revision control system associated with the 
bug. We had to disqualify bugs which didn’t have a clear hx 
location. 

V. Results 

As can be seen in the Evaluation results table, the ranking 
achieved using a combination of Execution Order and Differ¬ 
ential Basic Block Hit is 1-6. 

During the experiments, we observed that if one or more of 
the faulty hunks are ranked in the top 10 search result entries, 
the developer can quickly browse and locate the offending 
change in a matter of minutes. 

Note that for the hrst few bugs (in Ant project) only 
Execution Order was used, mainly because the developer 
didn’t need Differential Basic Block Hit ,as the ranking is 
already high. 

VI. Runtime overhead 

When developers use debugging tools in a lab environment, 
they can usually tolerate a minor slowdown, such that their 
workhow is not disrupted. We have measured a 1-10 percent 
slowdowns, depending on the test: 

1) Changes instrumentation overhead depends on number 
of changes and their location, but generally has a very 
low impact, almost not measurable. It inserts a function 
call that does very little work and writes a 64 bits long 
integer to a circular memory buffer. This is less than 








Bug 

From 

Description 

Ra 

EO 

nk 

EO+E 

Dump Point 

52923 

Ant 

JUnit4 test should not run as JUnit3 if annotated 
RunWith(JUnit4) 

2 


Added delay inside the reproducer test case 
(Example.java) to sleep for few seconds to al¬ 
low clicking dump button after the (wrong) test 
method was called 

50007 

Tnt 

Taskdef classpath breaks with directory contain¬ 
ing an exclamation mark (!) 

4 

■ 

When exception is thrown 

50953 

Ant 

Tasks defined with an antlib within a jar in 
succeeds with Ant 1.7.1 and fails with 1.8.2 

1 

■ 

When exception is thrown 

51387 

Ant 

Performance regression in ant task to launch 
external processes 

2 

“ 

Used VisualVM Profiler to automatically locate 
the hotspots, then paused in debugger 

55227 

Tnt 

JUnit 3 tests —> run with JUnit 4 —>• put ©Ig¬ 
nore annotation —>■ ant incorrectly executes the 
©igonored test 

2 


Added delay inside the reproducer test case 
(Example.java) to sleep for few seconds to al¬ 
low clicking dump button after the (wrong) test 
method was called 

252887 

Eclipse 

Pressing FI causes editor properties to disappear 

10 

6 

Breakpoint in main keyboard handler after 
pressed El 

378390 

Eclipse 

Java Search regression tor references of a method 

1 

1 

Recorded Eclipse Java Editor with another file 
+ Open Search dialog, but didn’t execute search 
—>■ Dump —>■ reproduce 0 results Java Search —>■ 
Dump again 

18201 

Tomcat 

getReaderO does not throw UnsupportedEn- 
codingException when bogus charset is used 

1 

1 

Breakpoint after getReader (which is called di- 

rectly from user servlet test reproducer code) 

44405 

Tomcat 

Regression with tomcat 6.0.16, Null- 
PointerException on getServletCon- 

text() .getResource AsStreamO 

13 

1 

Recorded several requests —^ Dump —>■ repro¬ 
duce NullPointerException —>■ Dump 


TABLE II: Evaluation results 


EO: Execution Order 

EO+D: Execution Order + Differential Basic Block Hit 


most function calls in the system and is usually quite 
rare in the execution trace (even compared to standard 
logging). 

2) We believe we can apply the changes instrumentation 
even in Production Servers, where the changes logging 
are taken to further analysis in the lab, where the 
developer can execute the same scenario with more 
instrumentation and discover the faulty change, even if 
the bug doesn’t reproduce in the lab. 

3) Much of the slowdown is in the Basic Blocks (Emma) 
instrumentation, as with any method that uses granular 
coverage for testing or fault localization. Our prototype 
is not at all optimized: We believe we could obtain 
similar results using only function entry logging, in 
addition to changes logging, which would reduce the 
overhead considerably. 

4) Using our prototype with many real world bugs, our 
subjective impression is, that the developer can still 
debug at normal speed, with somewhat longer startup 
time for large projects (such as Eclipse). The slowdown 
is hardly noticeable with shorter scenarios, such as ant 
command line reproducer samples taken from the Ant 
bug database. 

VII. Threats to Validity 

Our study, like any of this nature, has limitations which may 
impact the validity of our findings. We now proceed to identify 
some of these limitations, their impact, and how we attempted 


to address them. The primary threat to the external validity is 
the generalizability of our results acquired from the 9 evaluated 
bugs in 3 Java based systems. Our Temporal trace distance 
assumption may not hold for example, for initialization bugs. 
In these bugs the faulty change is executed at system start 
up, but the error occurs much later. In response, we intend 
to introduce slicing and dataflow based methods, to detect 
a data dependency between the error location and the faulty 
change, provided the bug scenario exhibits such location (Ex: 
Exception throw location). Our Textual Affinity search may 
also help with such cases. Note that if the faulty change 
occurs in a specific scenario, long before the error and doesn’t 
occur in common scenarios, such as initialization, we can still 
localize it using the Differential Basic Block Hit method. 

VIII. Related Work 

Several studies compare behaviors of faulty and correct 
program versions on the same inputs, using various spectra, 
such as path and value, to localize regressions ED, il 

In our work, we used block hit spectra, only as a weak 
rank/fllter, to isolate the area in code relevant to the regression. 
The actual offending lines are usually detected by ranking the 
program changes using both execution order, which is not used 
by the above. The above methods also require instrumenta¬ 
tion and execution of the faulty version (as in our work), 
plus the previous correct version, which our work doesn’t 
require. While in some cases it is easy enough to obtain 
the previous versions, instrument and execute them, there 


















are often changing configurations and external dependencies, 
such as remote services and database schemes that makes it 
difficult to setup older code, several months old, for debug. 
In addition, some of our evaluation subjects are much larger 
(Tomcat,Eclipse JDTCore), in terms of the amount of changes 
to rank, compared to most of the subject projects in the above 
work. 

The CodePsychologist, which was previously developed in 
our group Eol, is a tool which assists the programmer to locate 
source code segments that cause a given regression bug. The 
CodePsychologist uses affinity ranking to estimate how close 
a segment of code is to a given test case. Affinity between 
groups of words is calculated based on the semantic similarity 
between pairs of words from each group, measured as the 
inverse path length in a WordNet taxonomy. While in our 
current work, we did also experiment with text based semantic 
similarity, we discovered that a) The bug descriptions do not 
always contain the search terms needed to locate the offending 
code change. Test case descriptions were used for search 
in Eol and b) That dynamic analysis by itself is accurate 
enough in most cases. Semantic search was successfully used 
in feature and bug localization in ifTOl and experimenting 
with range of IR techniques, such as LSI. 

Darwin uses symbolic execution to automatically syn¬ 
thesize a new input that (a) is very similar to the failing input, 
and (b) does not fail. It then identifies code fragments where 
the control flow diverge. Darwin uses powerful techniques 
that are accurate and can even localize non-regression bugs. 
Unlike our method, these techniques incur significant runtime 
overhead, are a lot harder to implement, require reproduction 
of the bug and a reference system, where the bug doesn’t 
reproduce. 

RPRISM 12^ localize regression faults using trace differ¬ 
ence using two versions of the program (good,buggy) and 
three comparisons: correct scenario in the two versions (low 
rank) and regression scenario, which is correct in the good 
version and generates error in the buggy version (high rank) 
and diffs occurring between the correct and buggy scenario, 
only on the new version - this is similar to the diff we use 
in RegressionDetective. It aligns and compares ordered traces 
per thread, method and object-instance. Note that it doesn’t 
exploit temporal trace distance, as in our work. 

Software reconnaissance 1^ is an early work that popu¬ 
larized using coverage differences to location features in code 
using 2 or more scenarios, some executing the feature code 
and some not. In our work, we use a similar method to localize 
the feature (or system module) containing the regression fault, 
filtering out irrelevant changes (noise). 

Delta Debugging ll26l is divide-and-conquer greedy algo¬ 
rithm, systematically searching for failure-inducing changes 
by patching a subset of changes and observing execution 
results. It is capable of locating regression bugs, even in non¬ 
source code changes (such as configuration change) and is 
fully automated. On the other hand, it requires automated 
test, which may be hard to write for UI applications and 
an automated way to build and run the system in every past 


revision, which is non-trivial in many projects and may take 
hours (as explained above). As a greedy algorithm, it may get 
stuck in a local minima, outputting sub-optimal results. Unlike 
our method, it can only handle reproducible bugs. Recent study 
E3 evaluated delta debugging using real world regression 
bugs of Unix command lines tools and concluded that two- 
thirds of the reported changes were related to the bug. 

RADAR ESll localize regression faults using by construct¬ 
ing models of the good version and applying them on the 
trace of the new (bad) version, to detect model violations. 
It instruments only statements that occur unaltered in both 
the good and bad version of a modified function, plus its 
callers and callees. The data collected consist of Daikon ||2^ 
invariants and FSA to capture intra function control flow. 
Unlike our tool, RADAR relies on both 2 versions and a 
test suite of passing and failing tests. In the new version, 
only failing test cases are instrumented. Detected anomalities 
are reported in the order they are observed to allow easy 
understand of sequence of events. This is also very different 
from our approach, which reports the suspected changes closed 
to the observed error first. 

FILL 1301 enhances FaultTracer a using a novel form 
of injected program edits, borrowed from Mutation Testing 
techniques. It assumes test suite with failing and passing tests, 
old version and new, buggy version of the code. It injects 
artificial mutants into the old version, rerun the test-suite. If 
there is a correlation between the tests results of the old code 
plus mutants and the new code (with bugs), it maps the mutants 
to real changes to boost their suspiciousness rank. 

Fault Localization using dynamic slicing and change impact 
analysis ED uses a combination of backward and forward 
dynamic slicing from an incorrect test output value and from 
the change to filter out statements which do not affect the final 
result, or couldn’t be affected by the change. It experiments 
with several accurate but high runtime overhead methods. It 
also assumes the presence of an incorrect output value to slice 
from. 

IX. Conclusions and Future work 

We introduced a method to localize regression bugs that is 
not automatic, but requires very little from the developer, in 
comparison with the detailed know-how required to localize 
a bug with a traditional debugger. We assume that given a 
bug report, the developer should be able to reproduce the 
bug, or at least execute the bug scenario without reproducing 
it, exercising the faulty change. The developer also has to 
click the ’’dump” button twice, before and right after the error 
is observed. In our evaluation, it only took several minutes 
to obtain a ranked search result for a given bug. Our main 
result, is that combining temporal locality between faults and 
observed errors, with program execution differences between 
the regression scenario and a non-bug scenario can help the 
developer rank the faulty changes. We also created a Textual 
Affinity search engine as another ranking methods, but ended 
up omitting it from the experiments. 


We view Delta Debugging as potentially complementary to 
our approach: The developer can benefit from experimenting 
with our top rank suspected changes, and perform some 
automatic or semi-automatic patching to further isolate the 
faulty change. The iregression project fi32\ filters out non¬ 
executing changes as preprocessing step for its Delta Debug¬ 
ging algorithm variant. 

Our current method only handles changes to executable 
code. For example, faulty change to a configuration file is not 
taken into account. We investigate scalable dynamic data-flow 
analysis methods to alleviate this problem in the future. 

Refactoring changes, such as local variable rename, rarely 
result in a regression bug. Work on identifying refactoring 
changes ll^ can be used to improve our ranking heuristics. 
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